A lower bound on compression of unknown alphabets

نویسندگان

  • Nikola Jevtic
  • Alon Orlitsky
  • Narayana P. Santhanam
چکیده

Many applications call for universal compression of strings over large, possibly infinite, alphabets. However, it has long been known that the resulting redundancy is infinite even for i.i.d. distributions. It was recently shown that the redudancy of the strings’ patterns, which abstract the values of the symbols, retaining only their relative precedence, is sublinear in the blocklength n, hence the per-symbol redundancy diminishes to zero. In this paper we show that pattern redundancy is at least (1.5 log 2 e)n bits. To do so, we construct a generating function whose coefficients lower bound the redundancy, and use Hayman’s saddle-point approximation technique to determine the coefficients’ asymptotic behavior.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Relations Between Greedy and Bit-Optimal LZ77 Encodings

This paper investigates the size in bits of the LZ77 encoding, which is the most popular and efficient variant of the Lempel–Ziv encodings used in data compression. We prove that, for a wide natural class of variable-length encoders for LZ77 phrases, the size of the greedily constructed LZ77 encoding on constant alphabets is within a factor O( logn log log logn ) of the optimal LZ77 encoding, w...

متن کامل

The Smallest Grammar Problem Revisited

In a seminal paper of Charikar et al. on the smallest grammar problem, the authors derive upper and lower bounds on the approximation ratios for several grammar-based compressors, but in all cases there is a gap between the lower and upper bound. Here we close the gaps for LZ78 and BISECTION by showing that the approximation ratio of LZ78 is Θ((n/ logn)), whereas the approximation ratio of BISE...

متن کامل

On Large Alphabet Compression

In this report, we present results in Large Alphabet Compression. We first show that the min-max redundancy of standard compression tends towards infinity for sufficiently large alphabets. With this, we motivate two other approaches that are employed in compressing large alphabets, namely pattern and shape compression. We then present upper and lower bounds on the min-max redundancy of the same.

متن کامل

ar X iv : c s / 06 03 06 8 v 1 [ cs . I T ] 1 7 M ar 2 00 6 Universal Lossless Compression with Unknown Alphabets - The Average

Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alp...

متن کامل

Universal Lossless Compression with Unknown Alphabets - The Average Case

Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Theor. Comput. Sci.

دوره 332  شماره 

صفحات  -

تاریخ انتشار 2005